Feature/model editing by ocg-goodfire · Pull Request #417 · goodfire-ai/spd

ocg-goodfire · 2026-02-25T20:56:44Z

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Implements a SLURM-based system for launching parallel Claude Code agents that investigate behaviors in SPD model decompositions. Key components: - spd-swarm CLI: Submits SLURM array job for N agents - Each agent starts isolated app backend (unique port, separate database) - Detailed system prompt guides agents through investigation methodology - Findings written to append-only JSONL files (events.jsonl, explanations.jsonl) New files: - spd/agent_swarm/schemas.py: BehaviorExplanation, SwarmEvent schemas - spd/agent_swarm/agent_prompt.py: Detailed API and methodology instructions - spd/agent_swarm/scripts/run_slurm_cli.py: CLI entry point - spd/agent_swarm/scripts/run_slurm.py: SLURM submission logic - spd/agent_swarm/scripts/run_agent.py: Worker script for each job Also adds SPD_APP_DB_PATH env var support for database isolation. https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

Previously used communicate() which buffers all output until process completes. Now streams directly to claude_output.txt so you can monitor agent activity with: tail -f <task_dir>/claude_output.txt https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

- Switch to --output-format stream-json for structured JSONL output - Add --max-turns parameter (default 50) to prevent runaway agents - Output file changed from claude_output.txt to claude_output.jsonl - Updated monitoring commands in logs to use jq for parsing Monitor with: tail -f task_*/claude_output.jsonl | jq -r '.result // empty' https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

Claude Code requires --verbose when using --output-format=stream-json with --print mode. https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

When multiple GPU-intensive requests are made concurrently (graph computation, optimization, intervention), the backend would hang. This adds a lock that returns HTTP 503 immediately if a GPU operation is already in progress, allowing clients to retry later. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Agents now create and update a research_log.md file with readable progress updates. This makes it easy to follow what the agent is doing and discovering without parsing JSONL files. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Show YYYY-MM-DD HH:MM:SS format and provide tip for getting timestamps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…rm-lIpTu

The JSON-RPC 2.0 spec requires that the "error" field must NOT be present when there is no error. Our MCPResponse was serializing "error": null in all success responses, causing Claude Code to reject the MCP connection with "Failed to connect" status. Added exclude_none=True to all model_dump() calls so null fields are omitted from the serialized response. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ings together

The backend subprocess had stdout=subprocess.PIPE but the pipe was never drained. When the pipe buffer filled (~64KB), tqdm.write() in the optimization loop would block forever. Fix: Write backend logs to task_dir/backend.log instead of piping. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- SPD_SWARM_TASK_DIR: backend derives db_path, events_path from this - SPD_SWARM_SUGGESTIONS_PATH: global suggestions file Removed: - SPD_APP_DB_PATH, SPD_MCP_EVENTS_PATH, SPD_MCP_TASK_DIR (consolidated) - Unused AgentOutput schema Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

# Conflicts: # CLAUDE.md # pyproject.toml # spd/app/CLAUDE.md # spd/app/backend/routers/__init__.py # spd/app/backend/routers/intervention.py # spd/app/backend/server.py # spd/app/backend/state.py # spd/app/frontend/src/components/RunView.svelte # spd/app/frontend/src/lib/api/index.ts

Reshapes the swarm module into a focused investigation tool where a researcher poses a specific question and a single agent investigates it. Key changes: - Rename spd/agent_swarm/ → spd/investigate/, CLI spd-swarm → spd-investigate - Single SLURM job instead of array, flat output dir structure - Agent prompt accepts researcher's question + injects model architecture info - 5 new MCP tools: probe_component, get_component_activation_examples, get_component_attributions, get_model_info, get_attribution_strength - MCP dispatch refactored from if/elif chain to lookup tables - Investigations scoped to loaded run via DepLoadedRun - Frontend: refresh button, @file prompt input, launch-from-UI flow - Graph artifacts expand to natural size, research log flows with page Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Normalize wandb_path to canonical form (entity/project/run_id) when storing investigation metadata and when filtering. Handles old investigations that stored the "runs/" form. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tigations UX - Run picker: replace hardcoded modelName with fetched arch info (e.g. "SS LlamaSimple 4L d512"), add dataset_short to pretrain_info - Artifact graphs: use shared graphLayout.ts for canonical layer names, fixing topological grouping (q/k/v rows, gate/up rows) - Investigations: add launch-from-UI, @file prompt support, refresh button, remove research log scroll trap, scope to loaded run - Remove layerAliasing.ts — backend now handles concrete→canonical translation - Drop modelName from registry entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- get_component_info: translate canonical → concrete for harvest/interp lookups, canonicalize correlated component keys in response - save_graph_artifact: use 'embed' not 'wte' for pseudo-nodes - get_component_activation_examples: return canonical keys - Tool descriptions: update examples to canonical format - ArtifactGraph: prefetch component data on mount for tooltip cards - Filter both 'wte' and 'embed' as non-interventable nodes - Remove unused CSS selector in StagedNodesPanel Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Persists alongside other artifacts instead of being tied to a repo checkout. Keyed by run, so multiple runs share the DB safely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --permission-mode dontAsk and --allowedTools mcp__spd__* to Claude Code launch, preventing use of Bash/Read/Write/Edit and blocking inheritance from ~/.claude/settings.json - Revert DB path back to .data/app/prompt_attr.db Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --setting-sources "" to skip all user/project settings (no plugins, no inherited model, no alwaysThinkingEnabled) - Add --model opus explicitly since global settings are skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three-phase context-aware component labeling using network graph structure: 1. Output pass (late→early): labels what each component does, with downstream neighbor context 2. Input pass (early→late): labels what triggers each component, with upstream + co-firing context 3. Unification: synthesizes output + input labels into unified label Output and input passes are independent (both layer-serial, but no cross-dependency). Also extracts shared prompt helpers from dual_view.py into autointerp/prompt_helpers.py, and uses the topology module's CanonicalWeight system for correct layer ordering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

get_cofiring_neighbors no longer reads from the DB — it returns pure co-firing stats (Jaccard/PMI) with no labels. This ensures the input and output passes have zero logical coupling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- neighbors.py → graph_context.py - NeighborContext → RelatedComponent - get_downstream_neighbors → get_downstream_components - get_upstream_neighbors → get_upstream_components - get_cofiring_neighbors → get_cofiring_components - top_k_neighbors → top_k_attributed - DB columns: neighbor_key → related_key, neighbor_label → related_label "Neighbours" implied same-layer adjacency; "related components" better conveys the attribution-graph and co-firing relationships. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

If a component failed its output or input pass (e.g. transient API error), the unification pass now logs a warning and skips it instead of asserting and silently deadlocking the async pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each directional pass now maintains labels_so_far: dict[str, LabelResult] as the scan accumulator. Related components look up labels from this dict instead of querying the DB. The DB is seeded from on resume and written to for durability, but never read mid-scan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t_circuit, get_ci, find_components_by_examples) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ss improvements Autointerp: - Lazy component loading: use get_summary() + per-component get_component() instead of loading all 38k components upfront (was 24min, now instant start) - Worker pool concurrency in map_llm_calls: bounded job queue replaces semaphore+done_callback+active set. try/finally guarantees sentinel delivery. - CostTracker is now public; graph_interp shares one across all 3 passes - Resume support: --autointerp_subrun_id to continue from existing subrun - Add Pile dataset description to compact_skeptical strategy Postprocess: - Add graph_interp to PostprocessConfig (with attributions validation) - Add --dependency flag to spd-postprocess CLI (argparse, not fire) - Thread dependency_job_id through submit_harvest - Add s-82ffb969 postprocess config Other: - Harvest config: activation_examples_per_component 1000 -> 400 - Harvest DB: remove stale debug logging - App TODO: remove resolved SQLite immutable audit item - App + harvest + autointerp repo changes (pre-existing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The reader (InterpRepo.open) uses ?immutable=1 which gives a frozen snapshot — fine for NFS but breaks if the DB is still being written. Rather than fighting SQLite's NFS limitations, just gate the reader on a .done marker that the writer creates after finishing. Backfilled existing subruns with .done markers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- compute_masked_predictions() in compute.py: CI, stochastic (binomial), and adversarial (PGD) forward passes with top-k predictions - Masked predictions computed during intervention forward ("bake") and persisted in intervention_runs table alongside the intervention result - Base run auto-bakes on entering interventions view (all interventable nodes) - Sticky prediction rows bar at top of graph, zoom controls moved to bottom-left - Eval PGD inputs (steps/step_size) in intervention controls toolbar - Run registry endpoint + table-based RunSelector with data availability columns - Minor graph_interp, harvest, and frontend cleanup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion runs

…add loss context - Fix adversarial PGD bug: was using selected nodes as alive masks (no-op since selected have CI=1). Now uses graph's CI>0 alive masks so PGD optimizes alive-but-unselected components. - Merge compute_intervention_forward + compute_masked_predictions into single compute_intervention returning InterventionResult with CI, stochastic, and adversarial top-k predictions + per-regime loss. - Add loss context: standard graphs use MeanKLLossConfig, optimized graphs use their CE/KL loss. Both PGD objective and reported metrics match. - Add MeanKLLossConfig, PositionalLossConfig types. Make loss_config required on run_adv_pgd (remove None fallback). - Remove adv_pgd_out_logits from graph storage (intervention-level, not graph-level). Remove adv_pgd fields from OutputProbability/OutputNodeCard. - Add build_graph_alive_masks helper with validation asserts. - Simplify intervention router: single result column, remove fork endpoints. - Add prediction chip hover tooltips in InterventionsView. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Target-sans (T\S) shows the full target model with only the unselected alive nodes ablated — answering "what do the deselected components contribute?" Complements CI (only selected active) by showing the inverse view. Mask: everything=1 except alive-but-unselected=0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

With weight deltas, components + delta = exact target model output. Without them, all-ones mask has small reconstruction error vs target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Exact target reconstruction requires full precision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add LabelPredictions to InterventionResult: per-regime prediction stats for the CE-optimized token. Frontend highlights matching topk chips with amber border, and appends a dashed-border chip when the label token falls outside topk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Alive masks (the PGD degrees of freedom and target-sans ablation set) should always be the model's full binarized CI — not the graph's potentially sparse optimized CI. compute_intervention now recomputes natural CI internally via one forward pass + calc_causal_importances. Removes build_graph_alive_masks (no longer needed) and the graph_alive_masks parameter from compute_intervention. Callers now pass sampling type instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ferring The frontend computes sans_nodes = allGraphNodes - selectedNodes and sends it in the request. This avoids the backend needing to know about graph-level vs natural-CI-level alive sets. Base intervention runs pass sans_nodes=[] (nothing to ablate). MCP ablation tool also passes sans_nodes=[]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sans_nodes: list | None (no default). Callers pass None explicitly when target-sans isn't needed (base intervention runs, MCP). Frontend sends sans_nodes only when the user has deselected nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ated - Spotlight mode: when "hide unpinned edges" is on and hovering a node (no pinned nodes), hide all edges and non-connected nodes; show only graph neighbors colored by edge polarity with strength-based opacity and grey outline - Refactor node rendering from parallel booleans to a global InteractionMode state machine (spotlight | focusing | resting) with O(1) per-node role lookup via getNodeRole() - Rename sans_nodes → nodes_to_ablate, target_sans → ablated across full stack (backend, frontend, DB JSON migration) - Fix ablated tooltip (was describing necessity, actually measures sufficiency) - Delete orphaned graph with missing base intervention run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Compute both signed and abs-target edge variants in a single backward pass using the analytical identity ∂|y|/∂x = sign(y)·∂y/∂x (valid because the app's target is a single scalar per component, unlike dataset attributions where abs breaks the grad(sum)=sum(grad) batch reduction trick). Backend: nullable edges_data_abs column on graphs table (backwards compat), abs edges flow through compute → DB → API alongside signed edges. Frontend: "Edge Variant" radio toggle in display settings (Signed / Abs Target), getActiveEdges() helper selects the active variant for graph canvas, spotlight mode, node cards, and intervention views. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Delete one-off scripts, investigation logs, reference papers, and editing experiment configs that shouldn't be in the repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Treat SPD components as rank-1 LoRA adapters and train specific V columns (read vectors) and/or U rows (write vectors) on arbitrary losses. Uses gradient hooks for per-column/row masking and snapshotted weight deltas so the model starts from exact target-model behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude and others added 30 commits January 30, 2026 20:40

Fix stream-json output requiring --verbose flag

ef5b0fd

Claude Code requires --verbose when using --output-format=stream-json with --print mode. https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

Add full timestamps to research log examples

4c4a843

Show YYYY-MM-DD HH:MM:SS format and provide tip for getting timestamps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/dev' into claude/slurm-agent-swa…

dcb28f4

…rm-lIpTu

wip: Integrate agent swarm with MCP for Claude Code tool access

cb6e6f0

wip: Refactor agent swarm MCP configuration to require all swarm sett…

39b5acb

…ings together

wip: Add graph artifacts to investigation research logs

b47733f

Fix investigation wandb_path matching

1b30e81

Normalize wandb_path to canonical form (entity/project/run_id) when storing investigation metadata and when filtering. Handles old investigations that stored the "runs/" form. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move app DB from repo-local .data/ to SPD_OUT_DIR/app/

474e2f3

Persists alongside other artifacts instead of being tied to a repo checkout. Keyed by run, so multiple runs share the DB safely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Editing and autointerp

eb480c4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add kaleido dep and __main__ to run_interpret

9ab56d9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace editing.py with editing/ package (adds optimize_circuit, prin…

7f0cb44

…t_circuit, get_ci, find_components_by_examples) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ocg-goodfire and others added 18 commits March 2, 2026 11:50

Add *.schema.json to gitignore

e9b85cd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tidy up registry

556265f

Add missing vals from db query

feaef8b

wip: Add KL metrics to masked predictions and auto-save base interven…

2d0cc5d

…tion runs

remove scratch files

a65bee1

Add back clustering to registry.ts

eb908aa

Fix target-sans to use weight deltas for exact target reconstruction

c1d2e05

With weight deltas, components + delta = exact target model output. Without them, all-ones mask has small reconstruction error vs target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Disable bf16 autocast for target-sans forward pass

abbca53

Exact target reconstruction requires full precision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert bf16 removal — weights are already bf16, autocast is irrelevant

82c73d8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ocg-goodfire mentioned this pull request Mar 3, 2026

Rewrite dataset attribution storage + 3 metrics #413

Closed

4 tasks

ocg-goodfire and others added 10 commits March 4, 2026 14:23

Update CLAUDE.md with clustering info

665b0e2

App UI improvements and minor fixes

86fb6c7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wip.

3a8fc31

Allow for deleting prompts

f158150

Remove scratch files before merge to dev

57e1373

Delete one-off scripts, investigation logs, reference papers, and editing experiment configs that shouldn't be in the repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Always use AllLayers mask for PPGD warmup

3ce38d0

Add n_samples to PPGD config

cd8fb39

ocg-goodfire closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/model editing#417

Feature/model editing#417
ocg-goodfire wants to merge 113 commits intodevfrom
feature/model-editing

ocg-goodfire commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ocg-goodfire commented Feb 25, 2026

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants